Classifying Structured Web Sources using Aggressive Feature Selection
نویسندگان
چکیده
This paper studies the problem of classifying structured data sources on the Web. While prior works use all features, once extracted from search interfaces, we further refine the feature set. In our research, each search interface is treated simply as a bag-of-words. We choose a subset of words, which is suited to classify web sources, by our feature selection methods with new metrics and a novel simple ranking scheme. Using aggressive feature selection approach, together with a Gaussian process classifier, we obtained high classification performance in an evaluation over real web data.
منابع مشابه
A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification
In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...
متن کاملA Framework for Extracting, Classifying, Analyzing, and Presenting Information from Semi-Structured Web Data Sources
Extracting information from the web data sources becomes very important because the massive and increasing amount of diverse semi-structured information sources in the Internet that are available to users, and the variety of web pages making the process of information extraction from web a challenging problem. This paper proposes a framework for extracting, classifying, analyzing, and presentin...
متن کاملMaking use of category structure for multi-class classification
Multi-class classification is the task of organizing data samples into multiple predefined categories. In this thesis, we address two different research problems of multi-class classification, one specific and the other general. The first and specific problem is to categorize structured data sources on the Web. While prior works use all features, once extracted from search interfaces, we furthe...
متن کاملOptimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines
In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...
متن کاملOptimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines
In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009